Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish
نویسندگان
چکیده
In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and nounargument n-grams. Additionally, distributional vector space representations of the words are induced using the word2vec method. All n-gram collections as well as the vector space models are made available under an open license.
منابع مشابه
Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality
This paper presents the first results on detecting informality, machine and human translations in the Finnish Internet Parsebank, a project developing a large-scale, web-based corpus with full morphological and syntactic analyses. The paper aims at classifying the Parsebank according to these criteria, as well as studying the linguistic characteristics of the classes. The features used include ...
متن کاملWeb-scale Surface and Syntactic n-gram Features for Dependency Parsing
We develop novel firstand second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books. We also extend previous work on surface n-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks. Surface a...
متن کاملBuilding a Large Automatically Parsed Corpus of Finnish
We describe the methods and resources used to build FinnTreeBank-3, a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme, we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a lar...
متن کاملSpecifying Treebanks, Outsourcing Parsebanks: FinnTreeBank 3
Corpus-based treebank annotation is known to result in incomplete coverage of midand low-frequency linguistic constructions: the linguistic representation and corpus annotation quality are sometimes suboptimal. Large descriptive grammars cover also many midand low-frequency constructions. We argue for use of large descriptive grammars and their sample sentences as a basis for specifying higher-...
متن کاملN-gram Counts and Language Models from the Common Crawl
We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the c...
متن کامل